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ABSTRACT 

With the immense growth in the number of available 
protein structures, fast and accurate structure com- 
parison has been essential. We propose an efficient 
method for structure comparison, based on a struc- 
tural alphabet. Protein Blocks (PBs) is a widely 
used structural alphabet with 16 pentapeptide con- 
formations that can fairly approximate a complete 
protein chain. Thus a 3D structure can be translated 
into a 1D sequence of PBs. With a simple 
Needleman-Wunsch approach and a raw PB substi- 
tution matrix, PB-based structural alignments were 
better than many popular methods. iPBA web server 
presents an improved alignment approach using 

(i) specialized PB Substitution Matrices (SM) and 

(ii) anchor-based alignment methodology. With 
these developments, the quality of ~88% of align- 
ments was improved. iPBA alignments were also 
better than DALI, MUSTANG and GANGSTA+ in 
>80% of the cases. The webserver is designed to 
for both pairwise comparisons and database se- 
arches. Outputs are given as sequence alignment 
and superposed 3D structures displayed using 
PyMol and Jmol. A local alignment option for detect- 
ing subs-structural similarity is also embedded. As a 
fast and efficient 'sequence-based' structure com- 
parison tool, we believe that it will be quite useful to 
the scientific community. iPBA can be accessed at 
http://www.dsimb.inserm.fr/dsimb_tools/ipba/. 

INTRODUCTION 

Continuous increase in number of 3D structures of 
proteins necessitates development of efficient tools for 



structure comparison. Such developments facilitate char- 
acterization of function of a protein of known structure 
(1) or aid in evolutionary studies (2-4). Considering the 
complexity involved in obtaining an optimal superposition 
solely by global structural searches, a large majority of the 
structural alignment approaches focus on optimizing a 
combination of local segments of similarity to derive the 
global alignment (5-7). Many of the very recent appr- 
oaches consider the match between secondary structural 
elements (8-10) while others are fragment based (11-16). 
This idea is extended further to investigate flexibility of 
protein structures (17,18). 

Local backbone conformations such as a-helices, 
P-strands, P-turns and PPII helices characterize a large 
part tertiary structure of a protein chain. A complete pro- 
tein backbone can be approximated with a limited set of 
local conformations. Such a collection of local structural 
prototypes is called Structural Alphabets (SA). Protein 
Blocks (PBs) (19-21) is one such SA involving 16 penta- 
peptide conformations (represented by alphabets a to /?), 
characterized by backbone dihedral angles. Several bio- 
logical questions could be addressed based on PB-based 
abstraction. 

The main chain 3D information can be represented as a 
sequence in ID, using PBs. This reduces the problem of 
protein structural comparison to a classical sequence align- 
ment. Dynamic programming algorithms like Needleman 
Wunsch (22) and Smith Waterman (23) were used earlier 
for PB alignment and PB substitution matrix was 
generated for scoring the alignment (24-26). We propose 
an improved and novel version of PB alignment using (i) 
specialized substitution matrices for pairwise alignment 
and database search and (ii) an anchor-based dynamic 
programming algorithm. Most of the recent web tools 
for structure comparison are either dedicated to a 
database search (9-10,13,27,28) or for pairwise structural 
alignments (29-32). As an efficient tool for both pairwise 
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alignments and database searches, this web-server serves 
as a good platform for such studies. A local alignment 
strategy for motif or sub-structure search is also available. 
The proposed development provides output such as: (i) dif- 
ferent scoring schemes to indicate the quality of the align- 
ment, (ii) user-friendly interface to view and analyze the 
3D superposition and (hi) downloadable alignment files 
(both sequence and structural alignment). 



MATERIALS AND METHODS 

The server can be used to search for structural relatives of 
a query protein (Figure 1A) or to compare two protein 
structures (Figure IB). In both cases, the user can decide 
whether to carry out alignments for the complete structure 
(global) or to look for the best local similarity (local). 

Input 

For comparing two structures, the user can either provide 
the coordinates in the standard PDB format or enter the 
PDB code. The identifiers of chains to be compared 
should also be given. For searching related protein struc- 
ture in database, only one PDB file or code is necessary 
(Figure 1A and B). 

Pre-processing 

Atomic coordinate sets are first translated into sequence of 
PBs (Figure 1C). PBs constitute 16 pentapeptide conform- 
ations (labeled from a to each described by a series of <S>, 
* dihedral angles. A reasonable approximation of local 
structures (19) with a root mean square deviation 
(RMSD) of 0.42 A could be obtained (33). 

Computing pairwise alignment 

The alignment method implemented in this server repre- 
sents a significant improvement over our earlier work (24). 
In the previous work, the PB substitution matrix was gene- 
rated from pairwise alignments in PALI database (3). This 
database was redundant in terms of the distribution of 
related proteins. We have so refined the databank. 
Hence the PB substitutions were calculated from a non- 
redundant subset sharing sequence identity <40% and a 
refined substitution matrix was generated. Also, in our 
previous approach, a simple Needleman-Wunsch (22) al- 
gorithm was used for alignment. Protein structural homo- 
logues are often characterized by conserved stretches 
separated by variable regions. Hence a combination of 
local and global alignment is expected to give a better 
performance. 

A set of local alignments (anchors) associated with these 
two sequences is derived using a modified version of SIM 
algorithm (34). The remaining segments between anchors 
(linkers) are then aligned using the Needleman-Wunsch 
algorithm (Figure 1C). Affine gap penalties are used for 
the anchor and linker alignments. Distance constraints on 
the structures are included to identify false anchors. The 
different parameters were optimized as done in the 
previous work based on alignments of proteins in PALI 
data set (3). A total of 80% of the alignments were better 



when compared to that obtained with our previous work 
(24). 

Different scores are used to quantify the quality of PB 
alignment: 

The dynamic programminga lignment score : 
AlnJScore = Alignment score/ Alignment length 

A score similar to Global Distance Test Total score 
(GDT_TS) (35) for PB sequence alignment, derived 
using seven decreasing cut-offs of PB substitution scores 
(similar to distance cut-offs for GDTTS). 



GDT_PB 



{k-j+\)pj 

k{k+X)/2 



where k corresponds to the total number of thresholds 
used, i.e. 7. P f is the percentage of PB substitutions that 
are within the cut-off level j. The residue equivalences 
from the PB alignment then guides the 3D fitting of the 
structures by Pro Fit (36) (http://www.bioinf.org.uk/ 
software/profit/) which reports the RMSD and number 
of aligned residues (within 5 A) (Figure 2). The GDT TS 
score for the alignment is also provided along with 
the Aln_Score and GDTPB. Note that the GDT TS 
score used for comparison of iPBA with other web-tools 
(Table 1) was computed with a maximum distance thresh- 
old of 5 A. The percentage of equivalent residues was 
calculated from only one of the protein lengths. These 
variations were included to avoid bias in the score due 
to the different distance thresholds used by different 
methods and also due to incomplete alignment outputs 
provided by the servers. 

Database search 

A sequence of PBs can also be used to search for struc- 
turally related proteins from a data set of structures 
(Figure 1A). SCOP version 1.75 SCOP (37) is used as 
the structure data set and the user can also search 
refined subsets derived at different sequence identity 
cut-offs. The top 100 hits are reported based on the PB 
alignment score which is scaled to values between —13 and 
17. Values >1.5 are generally associated with high confi- 
dence. GDT PB scores are also provided for the hits 
obtained. To account for the speed, structure based refine- 
ments are not included. User can carry out further align- 
ments of the hits obtained (Figure 1A and B). 

Output for pairwise alignments 

With the help of Jmol applet, users can have a 3D analysis 
of superposed structures and also choose different visual 
representations of structure (Figure ID). Images of 
aligned structures rendered in PyMol are also provided. 
The residue equivalences in the 3D alignment are given 
as a complete sequence alignment. The corresponding 
PBs are also shown in the alignment. PB stretches of 
high similarity, identified as anchors, are also highlighted 
(Figure ID). The user can download coordinates of 
aligned structures in PDB format and PyMol scripts for 
local analysis of the superposition. Raw output file with 
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Pymol rendered images 
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Figure 1. The framework of iPBA and underlying methods. User can either compare two structures or search for structural neighbors (mining) from 
a databank. The input and output web interfaces for pairwise structural alignment are highlighted with a blue background. The web interfaces for 
mining has a green background. The rest of the figure (white background) gives the outline of underlying methodological aspects. (A) Search for 
structural similar protein in 3D database. (B) Compare two protein structures. (C) Alignment approach. (D) Main outputs. 
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iPBA: 2.33/124 




CE: 4.00/151 



DALI: 3.70/147 





TM- Align: 3.43/152 



GANGSTA+: 2.93/116 



ALADYN: 3.30/113 






Figure 2. Comparison of iPBA with other Rigid Body alignment methods. The 3D superposition of Nucleotide Kinases (PDB IDs: 1AKY and 
1GKY) by different methods is shown. The RMSD (in bold) and the number of aligned residues (as reported by the tool) are also given. 



Table 1. Comparison of iPBA with different structural alignment tools (web services) 



Family/Fold 



PDB Chains 



iPBA 



CE 



DALI 



TM-align 



FAT CAT 



Cyclin (all a) 



P) 



lVINa.UKWa 178 (2.15), 35.5 211 (3.30), 29.8 203 (3.30), 24.5 212 (3.32), 29.92 21 1 (2.96), 35.9 



FAD linked oxidase (ocH 
Nucleotide kinase (a/R) 
Serine Protease Inhibitor (small) 
Plastocyanin (all R) 
Aspargine Synthase (multi-domain) 



lDIIa,H19a 

lAKYa,lGKYa 

lCCVa,lCOUa 

2AZAa,lGYla 

UGTa,lCT9a 



316 (2.51), 26.8 
124 (2.33), 25.8 
45 (2.32), 23.3 
98 (1.87), 43.8 
388 (2.11), 41.4 



239 (3.20), 17.0 

151 (4.00), 17.0 

50 (3.10), 21.09 

104 (2.70), 39.2 

16 (3.10), - 



378 (3.60), 20.0 

147 (3.70), 18.3 

47 (3.00), 19.7 

101 (2.60), 36.94 

429 (3.10), 35.6 



413 (4.17), 22.0 

152 (3.43), 23.5 

53 (3.03), 25.2 

104 (2.50), 44.3 

436 (2.96), 39.6 



407 (3.07), 32.16 
144 (3.11), 29.4 
55 (3.19), 21.0 
105 (2.81), 38.8 
433 (3.00), 40.4 



Each protein pair is chosen in random from different structural classes (in parentheses), from the HOMSTRAD database (4). The number of aligned 
residues (as defined by different methods) and their RMSD is given within parentheses. The GDT_TS score calculated for increasing distances of 

0.5 A in the range 0.5-5 A, is also shown in italics. The best and second best scores are highlighted in red and blue. ( ) reflects the incomplete 

output of the program which limits GDT_TS calculation. Rigid-body approaches have been tested with CE, DALI and TM-Align. Best RMSD and 
GDT_TS of the rigid-body approaches have been highlighted in bold. 



sequence alignment and quality scores is also download- 
able in text format. 

Implementation 

Implementation of this tool is mainly done in C, Python, 
HTML and also using Jmol and PyMol programs. The 
front-end use is based on html and php. Perl/cgi programs 
control the input while python and C based programs 
carry out the processing behind the database search and 
pairwise comparisons. Direct visualization and manipula- 
tion of aligned structured is enabled with a Jmol applet 



and static images of superposed structures are rendered in 
PyMol using internal 'raytracer' option. Supplementary 
Data SI shows the schematic representation of series of 
steps involved in iPBA webserver. 



DISCUSSION 

As shown in Figure 1, it is quite simple to use the 
web-based iPBA alignment tool. User only needs to give 
the coordinates to mine SCOP (Figure 1A) or for pairwise 
superimposition (Figure IB). Outputs are mainly given 
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visually as sequence alignments and 3D structure super- 
impositions (Figure ID). Output alignment files can be 
also downloaded for local use. The local alignment strat- 
egy also provides a route to detect specific structural 
motifs in proteins. 

The improvement in the alignment methodology and 
the use of specialized PB substitution matrices has 
greatly enhanced the quality of alignments and the 
mining efficiency. The PB-based alignment approach had 
shown an impressive performance as a structure compari- 
son tool (24). Supplementary Figure 2 highlights the gain 
in alignment quality with respect to the earlier approach 
[PB ALIGN, (24)]. One hundred randomly chosen SCOP 
domain pairs sharing <40% sequence identity were used 
for comparison. 89% of the alignments have a better 
RMSD when compared to PBALIGN (Supplementary 
Data S2). Comparison performed on a bigger benchmark 
data set also suggested that a significant gain of 82% in 
alignment quality could be achieved. The mining efficiency 
also improved by 6.8% and the gain was largely uniform 
across different structural classes. 

To present a picture on the performance, the quality of 
alignments generated by iPBA was compared with the 
output alignments of some of the other well-established 
tools like CE, DALI, FATCAT and TMalign 
(7,18,38,39) (Table 1). For the full-length chains ('global' 
alignment option), the alignments generated using iPBA 
has the least RMSD. However, the number of aligned 
residues is also lower in many cases. GDT_TS scores are 
more appropriate in such cases to give a better idea of the 
alignment quality. As highlighted in Table 1, iPBA gener- 
ates alignments of very high quality. Among the 
non-flexible aligners (CE, DALI and TMalign), iPBA 
alignments have the best quality scores in the majority 
of cases. FATCAT produces flexible alignments and it is 
expected to give the best performance when flexible move- 
ments are involved. This is true for the first three cases in 
Table 1 where iPBA scores next to FATCAT. Thus the 
quality of iPBA alignments is largely comparable. In a 
systematic comparison using the standalone version of 
iPBA, the alignments were found to be better than 
DALI and MUSTANG in >80% of the cases. To dem- 
onstrate this, we chose the data set of 100 domain pairs 
from SCOP database, sharing <40% sequence identity. 
On this set of domain pairs, the alignments generated by 
iPBA were compared to those obtained with DALI (38), 
MUSTANG (40), GANGSTA+ (41) and TMalign (39). 
A total o 93.2 and 95.1% of the alignments had a better 
GDT TS score compared to DALI and MUSTANG 
alignments respectively (Supplementary Data 3A and B). 
The quality of ~81.6% of alignments were better than 
GANGSTA+ while the difference was less striking when 
compared to TMalign. About 45% of the alignments had 
a GDT_TS score lower than TMalign (Supplementary 
Data 3D), however the difference in scores for 80% of 
these cases was <3, reflecting a similar alignment. 

Figure 2 presents a view of the 3D alignments of two 
Nucleotide Kinase structures with similar folds, using dif- 
ferent non-flexible alignment approaches like DALI, CE, 
TM- Align, GANGSTA+ and ALADYN. As highlighted 
(also see Table 1), the alignment quality is better with 



iPBA. A closer look on the figure can show that iPBA 
gives a more refined alignment with the equivalent second- 
ary structural elements well fitted onto each other. 

CONCLUSION 

The ability to represent complete backbone conformation 
of the protein chain as a series of alphabets followed by 
the use of sequence alignment techniques mainly 
distinguishes iPBA from other structure comparison 
tools. In terms of alignment quality and the efficiency in 
detecting structural relatives, iPBA has been quite success- 
ful among the wide range of methods available (42). The 
local alignment option further adds to the utility of this 
approach. The web tool also provides an interface for the 
visualization and analysis of the alignments. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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